A Combination of Models for Bilingual Lexicon Extraction from Comparable Corpora

نویسندگان

  • Fatiha Sadat
  • Hervé Déjean
  • Éric Gaussier
چکیده

In this paper we present a method to extract bilingual terminologies from comparable non-aligned corpora, by using multiple linguistic knowledge sources, such as: non-parallel corpora, bilingual thesauri, a preliminary bilingual dictionary, etc... We focus on two core technologies: bilingual lexicon extraction from comparable corpora and expansion through thesauri categories based on different search strategies. Different translation-based strategies were combined in order to create an optimal translation model, and extract the most effective bilingual lexicon, and therefore extend any dictionary or thesaurus for any specified domain and pair of languages. Evaluation of the proposed translation model showed significant improvements, especially after an integration of hierarchical information contained in UMLS/MeSH thesauri.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction

This paper focuses on exploiting different models and methods in bilingual lexicon extraction, either from parallel or comparable corpora, in specialized domains. First, a special attention is given to the use of multilingual thesauri, and different search strategies based on such thesauri are investigated. Then, a method to combine the different models for bilingual lexicon extraction is prese...

متن کامل

Bilingual Word Embeddings for Bilingual Terminology Extraction from Specialized Comparable Corpora

Bilingual lexicon extraction from comparable corpora is constrained by the small amount of available data when dealing with specialized domains. This aspect penalizes the performance of distributionalbased approaches, which is closely related to the reliability of word’s cooccurrence counts extracted from comparable corpora. A solution to avoid this limitation is to associate external resources...

متن کامل

Iterative Bilingual Lexicon Extraction from Comparable Corpora Using Topic Model and Context Based Methods

In the literature, two main categories of methods have been proposed for bilingual lexicon extraction from comparable corpora, namely topic model and context based methods. In this paper, we present a bilingual lexicon extraction system that is based on a novel combination of these two methods in an iterative process. Our system does not rely on any prior knowledge and the performance can be it...

متن کامل

Word Co-occurrence Counts Prediction for Bilingual Terminology Extraction from Comparable Corpora

Methods dealing with bilingual lexicon extraction from comparable corpora are often based on word co-occurrence observation and are by essence more effective when using large corpora. In most cases, specialized comparable corpora are of small size, and this particularity has a direct impact on bilingual terminology extraction results. In order to overcome insufficient data coverage and to make ...

متن کامل

Bilingual terminology extraction: an approach based on a multilingual thesaurus applicable to comparable corpora

This paper presents several methods for exploiting multiple resources in bilingual lexicon extraction, either from parallel or comparable corpora. First, a special attention is given to the use of multilingual thesauri, and different search strategies based on such thesauri are investigated. Then, a method to optimally combine the different resources for bilingual lexicon extraction is presente...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002